Code
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as pxLa industria aérea se caracteriza por una intensa competencia por lo que resulta de suma importancia conocer a los clientes, sus intereses y preferencias. Este trabajo se centra en analizar los datos de encuestas realizadas a pasajeros de una cierta aerolínea para evaluar distintos aspectos y cómo afectan en el nivel de satisfacción, para esto se utilizarán técnicas de Data Science.
El objetivo general del presente trabajo es predecir a través de un modelo de Machine Learning la satisfacción con la mayor asertividad posible, según ciertos contextos y analizar qué variables impactan con mayor correlación en la satisfacción, realizando el entrenamiento de un modelo predictivo
Se plantean como objetivos específicos: * Conocer las características y preferencias de los clientes de acuerdo al género, edad, tipo de cliente y clase de vuelo que utilizan. * Identificar cuáles son los servicios que deben mejorarse, y ver si se asocian a las características generales de los clientes * Analizar las características generales de los vuelos que tienen mayores inconvenientes (demoras en partida/arribo por ej), y ver si la información recolectada es de utilidad para proponer soluciones a los mismos. * Desarrollar un modelo predictivo que permita identificar el nivel de satisfacción de los pasajeros respecto a los servicios brindados
Los datos se obtuvieron del repositorio www.kaggle.com. Se trata de una base de datos estructurados, generada a partir de encuestas realizadas a más de 100k clientes. La misma cuenta con campos que permiten describir las características generales del cliente, como género, edad, tipo de viaje, categoría de pasajero, distancia del vuelo; como así también cuáles son las opiniones del mismo en relación a distintos aspectos del viaje. En este punto, utilizando escalas de Likert, se les consultó sobre distintos aspectos con el grado de satisfacción donde : 0 correspondía a variables donde la respuesta No Aplica, y los puntajes de 1 a 5 indican el nivel de satisfacción de los pasajeros.
Paquetes Numpy, Pandas, Matplotlib, Seaborn, Letsplot y Sklearn
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px#Import Request
import io
import requests
url_train = "https://github.com/jonezequiel92/Airline-Passenger-Satisfaction/raw/main/train.csv"
url_test = "https://github.com/jonezequiel92/Airline-Passenger-Satisfaction/raw/main/test.csv"
s_train=requests.get(url_train).content
s_test=requests.get(url_test).content
df_train=pd.read_csv(io.StringIO(s_train.decode('utf-8')))
df_test=pd.read_csv(io.StringIO(s_test.decode('utf-8')))Set Option
pd.set_option('display.max_columns', 500)
pd.set_option('display.float_format', lambda x: '%.3f' % x)print(df_train.shape)
print(df_test.shape)(103904, 25)
(25976, 25)
Modelado de Datos
df = pd.concat([df_train,df_test])
df.reset_index(inplace=True,drop=True)
df.head()| Unnamed: 0 | id | Gender | Customer Type | Age | Type of Travel | Class | Flight Distance | Inflight wifi service | Departure/Arrival time convenient | Ease of Online booking | Gate location | Food and drink | Online boarding | Seat comfort | Inflight entertainment | On-board service | Leg room service | Baggage handling | Checkin service | Inflight service | Cleanliness | Departure Delay in Minutes | Arrival Delay in Minutes | satisfaction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 70172 | Male | Loyal Customer | 13 | Personal Travel | Eco Plus | 460 | 3 | 4 | 3 | 1 | 5 | 3 | 5 | 5 | 4 | 3 | 4 | 4 | 5 | 5 | 25 | 18.000 | neutral or dissatisfied |
| 1 | 1 | 5047 | Male | disloyal Customer | 25 | Business travel | Business | 235 | 3 | 2 | 3 | 3 | 1 | 3 | 1 | 1 | 1 | 5 | 3 | 1 | 4 | 1 | 1 | 6.000 | neutral or dissatisfied |
| 2 | 2 | 110028 | Female | Loyal Customer | 26 | Business travel | Business | 1142 | 2 | 2 | 2 | 2 | 5 | 5 | 5 | 5 | 4 | 3 | 4 | 4 | 4 | 5 | 0 | 0.000 | satisfied |
| 3 | 3 | 24026 | Female | Loyal Customer | 25 | Business travel | Business | 562 | 2 | 5 | 5 | 5 | 2 | 2 | 2 | 2 | 2 | 5 | 3 | 1 | 4 | 2 | 11 | 9.000 | neutral or dissatisfied |
| 4 | 4 | 119299 | Male | Loyal Customer | 61 | Business travel | Business | 214 | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 3 | 3 | 4 | 4 | 3 | 3 | 3 | 0 | 0.000 | satisfied |
# se eliminan columnas que no son de interés
df.drop(['Unnamed: 0','id'],axis=1,inplace=True)df.head()| Gender | Customer Type | Age | Type of Travel | Class | Flight Distance | Inflight wifi service | Departure/Arrival time convenient | Ease of Online booking | Gate location | Food and drink | Online boarding | Seat comfort | Inflight entertainment | On-board service | Leg room service | Baggage handling | Checkin service | Inflight service | Cleanliness | Departure Delay in Minutes | Arrival Delay in Minutes | satisfaction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Male | Loyal Customer | 13 | Personal Travel | Eco Plus | 460 | 3 | 4 | 3 | 1 | 5 | 3 | 5 | 5 | 4 | 3 | 4 | 4 | 5 | 5 | 25 | 18.000 | neutral or dissatisfied |
| 1 | Male | disloyal Customer | 25 | Business travel | Business | 235 | 3 | 2 | 3 | 3 | 1 | 3 | 1 | 1 | 1 | 5 | 3 | 1 | 4 | 1 | 1 | 6.000 | neutral or dissatisfied |
| 2 | Female | Loyal Customer | 26 | Business travel | Business | 1142 | 2 | 2 | 2 | 2 | 5 | 5 | 5 | 5 | 4 | 3 | 4 | 4 | 4 | 5 | 0 | 0.000 | satisfied |
| 3 | Female | Loyal Customer | 25 | Business travel | Business | 562 | 2 | 5 | 5 | 5 | 2 | 2 | 2 | 2 | 2 | 5 | 3 | 1 | 4 | 2 | 11 | 9.000 | neutral or dissatisfied |
| 4 | Male | Loyal Customer | 61 | Business travel | Business | 214 | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 3 | 3 | 4 | 4 | 3 | 3 | 3 | 0 | 0.000 | satisfied |
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 129880 non-null object
1 Customer Type 129880 non-null object
2 Age 129880 non-null int64
3 Type of Travel 129880 non-null object
4 Class 129880 non-null object
5 Flight Distance 129880 non-null int64
6 Inflight wifi service 129880 non-null int64
7 Departure/Arrival time convenient 129880 non-null int64
8 Ease of Online booking 129880 non-null int64
9 Gate location 129880 non-null int64
10 Food and drink 129880 non-null int64
11 Online boarding 129880 non-null int64
12 Seat comfort 129880 non-null int64
13 Inflight entertainment 129880 non-null int64
14 On-board service 129880 non-null int64
15 Leg room service 129880 non-null int64
16 Baggage handling 129880 non-null int64
17 Checkin service 129880 non-null int64
18 Inflight service 129880 non-null int64
19 Cleanliness 129880 non-null int64
20 Departure Delay in Minutes 129880 non-null int64
21 Arrival Delay in Minutes 129487 non-null float64
22 satisfaction 129880 non-null object
dtypes: float64(1), int64(17), object(5)
memory usage: 22.8+ MB
# 393 valores nulos en Arrival Delay in Minutes
df.isnull().sum()Gender 0
Customer Type 0
Age 0
Type of Travel 0
Class 0
Flight Distance 0
Inflight wifi service 0
Departure/Arrival time convenient 0
Ease of Online booking 0
Gate location 0
Food and drink 0
Online boarding 0
Seat comfort 0
Inflight entertainment 0
On-board service 0
Leg room service 0
Baggage handling 0
Checkin service 0
Inflight service 0
Cleanliness 0
Departure Delay in Minutes 0
Arrival Delay in Minutes 393
satisfaction 0
dtype: int64
# al ser pocos valores nulos, 393 en 129880 registros se eliminan.
df.dropna(inplace=True)
df.reset_index(inplace=True,drop=True)# se modifica el tipo de dato de Arrival Delay in Minutes de decimal a entero
df['Arrival Delay in Minutes'] = df['Arrival Delay in Minutes'].astype('int64')#Se constata la ausencia de Valores faltantes
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129487 entries, 0 to 129486
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 129487 non-null object
1 Customer Type 129487 non-null object
2 Age 129487 non-null int64
3 Type of Travel 129487 non-null object
4 Class 129487 non-null object
5 Flight Distance 129487 non-null int64
6 Inflight wifi service 129487 non-null int64
7 Departure/Arrival time convenient 129487 non-null int64
8 Ease of Online booking 129487 non-null int64
9 Gate location 129487 non-null int64
10 Food and drink 129487 non-null int64
11 Online boarding 129487 non-null int64
12 Seat comfort 129487 non-null int64
13 Inflight entertainment 129487 non-null int64
14 On-board service 129487 non-null int64
15 Leg room service 129487 non-null int64
16 Baggage handling 129487 non-null int64
17 Checkin service 129487 non-null int64
18 Inflight service 129487 non-null int64
19 Cleanliness 129487 non-null int64
20 Departure Delay in Minutes 129487 non-null int64
21 Arrival Delay in Minutes 129487 non-null int64
22 satisfaction 129487 non-null object
dtypes: int64(18), object(5)
memory usage: 22.7+ MB
# variables categóricas
df.dtypes[df.dtypes == 'object']Gender object
Customer Type object
Type of Travel object
Class object
satisfaction object
dtype: object
# valores únicos en las variables categóricas
for i in df.dtypes[df.dtypes=='object'].index:
print(i)
print(df[i].unique())Gender
['Male' 'Female']
Customer Type
['Loyal Customer' 'disloyal Customer']
Type of Travel
['Personal Travel' 'Business travel']
Class
['Eco Plus' 'Business' 'Eco']
satisfaction
['neutral or dissatisfied' 'satisfied']
# crear dummies para las variables categóricas
df['transformed_Gender'] = df['Gender'].map({'Male':1,'Female':0})
df['transformed_Customer Type'] = df['Customer Type'].map({'Loyal Customer':1,'disloyal Customer':0})
df['transformed_Type of Travel'] = df['Type of Travel'].map({'Business travel':1,'Personal Travel':0})
df['transformed_Class'] = df['Class'].map({'Business':2,'Eco Plus':1,'Eco':0})
df['transformed_satisfaction'] = df['satisfaction'].map({'satisfied':1,'neutral or dissatisfied':0})df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129487 entries, 0 to 129486
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 129487 non-null object
1 Customer Type 129487 non-null object
2 Age 129487 non-null int64
3 Type of Travel 129487 non-null object
4 Class 129487 non-null object
5 Flight Distance 129487 non-null int64
6 Inflight wifi service 129487 non-null int64
7 Departure/Arrival time convenient 129487 non-null int64
8 Ease of Online booking 129487 non-null int64
9 Gate location 129487 non-null int64
10 Food and drink 129487 non-null int64
11 Online boarding 129487 non-null int64
12 Seat comfort 129487 non-null int64
13 Inflight entertainment 129487 non-null int64
14 On-board service 129487 non-null int64
15 Leg room service 129487 non-null int64
16 Baggage handling 129487 non-null int64
17 Checkin service 129487 non-null int64
18 Inflight service 129487 non-null int64
19 Cleanliness 129487 non-null int64
20 Departure Delay in Minutes 129487 non-null int64
21 Arrival Delay in Minutes 129487 non-null int64
22 satisfaction 129487 non-null object
23 transformed_Gender 129487 non-null int64
24 transformed_Customer Type 129487 non-null int64
25 transformed_Type of Travel 129487 non-null int64
26 transformed_Class 129487 non-null int64
27 transformed_satisfaction 129487 non-null int64
dtypes: int64(23), object(5)
memory usage: 27.7+ MB
Métrica, Funciones y Gráficos
df.head()| Gender | Customer Type | Age | Type of Travel | Class | Flight Distance | Inflight wifi service | Departure/Arrival time convenient | Ease of Online booking | Gate location | Food and drink | Online boarding | Seat comfort | Inflight entertainment | On-board service | Leg room service | Baggage handling | Checkin service | Inflight service | Cleanliness | Departure Delay in Minutes | Arrival Delay in Minutes | satisfaction | transformed_Gender | transformed_Customer Type | transformed_Type of Travel | transformed_Class | transformed_satisfaction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Male | Loyal Customer | 13 | Personal Travel | Eco Plus | 460 | 3 | 4 | 3 | 1 | 5 | 3 | 5 | 5 | 4 | 3 | 4 | 4 | 5 | 5 | 25 | 18 | neutral or dissatisfied | 1 | 1 | 0 | 1 | 0 |
| 1 | Male | disloyal Customer | 25 | Business travel | Business | 235 | 3 | 2 | 3 | 3 | 1 | 3 | 1 | 1 | 1 | 5 | 3 | 1 | 4 | 1 | 1 | 6 | neutral or dissatisfied | 1 | 0 | 1 | 2 | 0 |
| 2 | Female | Loyal Customer | 26 | Business travel | Business | 1142 | 2 | 2 | 2 | 2 | 5 | 5 | 5 | 5 | 4 | 3 | 4 | 4 | 4 | 5 | 0 | 0 | satisfied | 0 | 1 | 1 | 2 | 1 |
| 3 | Female | Loyal Customer | 25 | Business travel | Business | 562 | 2 | 5 | 5 | 5 | 2 | 2 | 2 | 2 | 2 | 5 | 3 | 1 | 4 | 2 | 11 | 9 | neutral or dissatisfied | 0 | 1 | 1 | 2 | 0 |
| 4 | Male | Loyal Customer | 61 | Business travel | Business | 214 | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 3 | 3 | 4 | 4 | 3 | 3 | 3 | 0 | 0 | satisfied | 1 | 1 | 1 | 2 | 1 |
df.shape(129487, 28)
df.describe().T[:-5]| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 129487.000 | 39.429 | 15.118 | 7.000 | 27.000 | 40.000 | 51.000 | 85.000 |
| Flight Distance | 129487.000 | 1190.211 | 997.561 | 31.000 | 414.000 | 844.000 | 1744.000 | 4983.000 |
| Inflight wifi service | 129487.000 | 2.729 | 1.329 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Departure/Arrival time convenient | 129487.000 | 3.057 | 1.527 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Ease of Online booking | 129487.000 | 2.757 | 1.402 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Gate location | 129487.000 | 2.977 | 1.279 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Food and drink | 129487.000 | 3.205 | 1.330 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Online boarding | 129487.000 | 3.253 | 1.351 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Seat comfort | 129487.000 | 3.442 | 1.319 | 0.000 | 2.000 | 4.000 | 5.000 | 5.000 |
| Inflight entertainment | 129487.000 | 3.358 | 1.334 | 0.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| On-board service | 129487.000 | 3.383 | 1.287 | 0.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Leg room service | 129487.000 | 3.351 | 1.316 | 0.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Baggage handling | 129487.000 | 3.632 | 1.180 | 1.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Checkin service | 129487.000 | 3.306 | 1.266 | 0.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Inflight service | 129487.000 | 3.642 | 1.177 | 0.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Cleanliness | 129487.000 | 3.286 | 1.314 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Departure Delay in Minutes | 129487.000 | 14.643 | 37.933 | 0.000 | 0.000 | 0.000 | 12.000 | 1592.000 |
| Arrival Delay in Minutes | 129487.000 | 15.091 | 38.466 | 0.000 | 0.000 | 0.000 | 13.000 | 1584.000 |
#Gráfico de Satisfación
plt.figure(figsize=(6,6))
labels = 'Neutral or dissatisfied', 'Satisfied'
explode = (0, 0.1)
df_group = df.satisfaction.value_counts(normalize=True).mul(100)
df_group.plot.pie(autopct="%.2f", cmap='tab10', labels=labels,explode=explode ,shadow=True).set(title='% de clientes según grado de satisfacción')
plt.show()#Gráfico Univariado
plt.figure(figsize=(6,6))
v,m,g=plt.hist(df['Age'], color='lightblue')
plt.title("Distribución de la edad",size=18)
plt.ylabel("Frecuencia",size=14)
for i, rect in enumerate(g):
posx = rect.get_x()
posy = rect.get_height()
plt.text(posx+0.5, posy + 30, int(v[i]), color='black', fontsize = 8,weight='bold')
plt.grid(color='r', linestyle='dotted', linewidth=1)
plt.show()# grafico para la conclusion 'La proporción de hombres y mujeres encuestados es similar.'
gender_perc = df.Gender.value_counts(normalize=True).mul(100)
fig, ax = plt.subplots(figsize=(6,6))
ax.pie(gender_perc, labels=gender_perc.index, autopct='%1.1f%%', startangle=90)
ax.axis('equal')
plt.title("Proporción según sexo",size=12)
plt.show()class_perc = df['Class'].value_counts(normalize=True).mul(100)
fig, ax = plt.subplots(figsize=(6,6))
ax.pie(class_perc, labels=class_perc.index, autopct='%1.1f%%', startangle=90)
ax.axis('equal')
plt.title("Proporción según clase del viaje",size=12)
plt.show()#Confort de los Asientos
from seaborn import countplot
plt.figure(figsize=(6,6))
ax = countplot(df['Seat comfort'], data=df)
plt.xticks(size =12)
plt.yticks(size =12)
plt.ylabel('Cantidad')
ax.set(xlabel=None)
abs_values = df['Seat comfort'].value_counts(ascending=False).values
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+120),weight='bold')
plt.title("Pasajeros según reseña de satisfacción sobre el asiento en vuelo",size=12)
plt.grid(color='r', linestyle='dotted', linewidth=1,axis='y')
plt.show()/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
list_poll=['Inflight wifi service','Departure/Arrival time convenient','Ease of Online booking','Food and drink','Online boarding','Cleanliness']
colores = ["#EE6055","#60D394","#AAF683","#FFD97D","#FF9B85","#FFFFFF"]
fig, axarr = plt.subplots(2, 3, figsize=(16, 10))
for index,i in enumerate(list_poll):
df[i].value_counts().plot.pie(autopct="%.1f",colors=colores,ax=axarr[int(index/3)][index%3])
sns.set(font_scale = 1.2)
plt.title('Proporciones de las encuestas de satisfacción', fontsize = 14, fontweight = 20)
plt.show()#Relación entre Clase y Edad
fig, ax = plt.subplots(figsize=(6,6))
ax = sns.boxplot(x=df['Class'], y=df['Age'])
ax.set_title('Relación entre Clase y Edad', {'fontsize':14},pad=20)
ax.set(xlabel=None)
ax.set_ylabel('Edad')
plt.grid(color='r', linestyle='dotted', linewidth=1)
plt.show()#Relación entre distancia de viaje y satisfacción
plt.figure(figsize=(6,6))
sns.violinplot(x='satisfaction', y='Flight Distance',data=df, palette = 'colorblind')
plt.title('Relación entre Distancia del viaje y Satisfacción', fontsize = 14, fontweight = 30)
plt.ylabel('Distancia del viaje')
plt.xlabel('')
plt.grid(color='r', linestyle='dotted', linewidth=1)
plt.show()pd.crosstab(df.Class, df.satisfaction)| satisfaction | neutral or dissatisfied | satisfied |
|---|---|---|
| Class | ||
| Business | 18940 | 43050 |
| Eco | 47215 | 10902 |
| Eco Plus | 7070 | 2310 |
pd.crosstab(df['Type of Travel'], df.satisfaction)| satisfaction | neutral or dissatisfied | satisfied |
|---|---|---|
| Type of Travel | ||
| Business travel | 37238 | 52207 |
| Personal Travel | 35987 | 4055 |
df.groupby('Type of Travel')['Flight Distance'].describe()| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Type of Travel | ||||||||
| Business travel | 71655.00 | 1368.29 | 1086.68 | 31.00 | 451.00 | 986.00 | 2143.00 | 4983.00 |
| Personal Travel | 32249.00 | 792.08 | 592.27 | 31.00 | 363.00 | 628.00 | 1023.00 | 4983.00 |
#Gráfico Multivariado
sns.set(font_scale = 1)
plt.figure(figsize=(6,6))
sns.violinplot(x="satisfaction", y="Age", hue="Gender",
data=df,
palette=['#008B8B','#00FFFF'],
split=True,
scale="count")
plt.title('Relación Género y Edad, según satisfacción', fontsize = 14, fontweight = 20)
plt.xlabel('')
plt.ylabel('Edad')
plt.grid(color='r', linestyle='dotted', linewidth=1,axis='y')
plt.show()#Relación entre los tiempos de espera y la satisfacción
plt.figure(figsize=(6,6))
sns.scatterplot(data=df, x='Arrival Delay in Minutes',y='Departure Delay in Minutes',hue='satisfaction',palette='gist_rainbow_r', alpha=0.8)
plt.grid() # agregar una grilla
plt.show()#Gráfico Multivariado
fig = px.violin(df, y="Flight Distance", x="Class", color="satisfaction", box=True,hover_data=df.columns)
fig.show()